The data set in this exploratory data analysis contains observations of 1599 different samples of red wine associated with the levels of the red wine quality and 12 more attributes.
## [1] 1599 14
The name and type of each variable are shown as below:
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide : num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ citric_over_fixed.acidity: num 0 0 0.00513 0.05 0 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality citric_over_fixed.acidity
## Min. : 8.40 Min. :3.000 Min. :0.00000
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:0.01292
## Median :10.20 Median :6.000 Median :0.03291
## Mean :10.42 Mean :5.636 Mean :0.03084
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:0.04503
## Max. :14.90 Max. :8.000 Max. :0.13929
We would like to explore the distribution of the sample red wine over each variable by plotting histograms. First, we want to have a glance at four variables that linked to acidity: ‘fixed.acidity’, ‘volatile.acidity’ ,‘citric.acid’, ‘pH’. Furthermore, we also want to know if the percentage of citric acid in the fixed acids matters to the red wine quality. Hence, we will add one more variable ‘citric_over_fixed.acidity’ to our dataframe.
By the four figures above, we observed that:
The majority of red wine samples have a ‘moderate’ acidity: fixed.acidity from 7 to 8, or volatile.acidity from 0.3 to 0.7.
The majority of red wine samples do not add citric acid. We will explore if this decision enhance red wine quality significantly.
Next, we will conduct preliminary exploration on the remain variables.
We have analyzed the remain variables above. The figures indicate:
Most of the remain variables(except for ‘quality’ and ‘density’ which are originally normal distributed) are long-tail distributed. After using in log10 scale on x-axis, the sample distributions are similar to normal distribution. However there is on exception which is the variable ‘alcohol’. Even using log10 scale on x-axis, it still looks long-tail.
The quality is normal distributed and the majority of the red wine are of level 5 to level 6.
This dataset has 1599 observation and 13 original variables. The variables can be divided into 4 part:
variables linked to acids: ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’, ‘pH’
varibales linked to other main components: ‘residual.sugar’, ‘alcohol’, ‘density’
variables linked to additives: ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘sulphates’
main variables: ‘quality’
Here are some main conclusions from the above plots:
The fixed acidity varies mostly from 0.1 to 1.0 with a few outliers above 1.0.
A significant proportion of red wine does not add citric acid. The citric acid density is potentially highly related to the pH value.
The residual.sugar distribution has high peak around 2 to 3. And the alcohol distirbution varies mostly from 9 to 14 with a high peak around 9.5
The distribution of cholrides and sulphates have relatively wider ranges.
In this analysis, the main interests would be ‘quality’ and ‘pH’.
Other variables such as ‘citric.acid’, ‘residual.sugar’, ‘alcohol’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘sulphates’ would also be important for a predictive model for ‘quality’.
‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’ would be helpful to investigate ‘pH’.
Yes, the ‘citric_over_fixed.acidity’ shows the ratio between ‘citric.acid’ and ‘fixed.acidity’
Many variables such as ‘residual.sugar’, ‘alcohol’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’ and ‘sulphates’ are long-tail distributions. We transfomed the x-axis scale to ‘log10’ scale to make the distribution more symmetric.
Here we will investigate the relationship between pairs of any two variables by the correation matrix below.
From above, we can draw conclusions:
Now, we start to explore relationship between two variables. First, let us use scatter plot to analyze 2 variable.
From the scatter plots above, we can see that only ‘alcohol’ and ‘volatile.acidity’ have noticable linear correlation with ‘quality’.
Here, let’s also explore the relation between other pairs of variables with high correlation coefficients except for the pairs that contains ‘quality’
It seems that ‘total.sulfur.dioxide’ and ‘free.sulfur.dioxide’ has a linear relation.
##
## Call:
## lm(formula = free.sulfur.dioxide ~ total.sulfur.dioxide, data = subset(redwine,
## total.sulfur.dioxide < quantile(redwine$total.sulfur.dioxide,
## 0.9)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4478 -3.5173 -0.9483 3.2228 28.1375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.211628 0.364571 8.809 <2e-16 ***
## total.sulfur.dioxide 0.296045 0.008251 35.881 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.843 on 1437 degrees of freedom
## Multiple R-squared: 0.4726, Adjusted R-squared: 0.4722
## F-statistic: 1287 on 1 and 1437 DF, p-value: < 2.2e-16
When dropping 10% percentage of the largest value of ‘total.sulfur.dioxide’, it explained around 47% variance of ‘free.sulfur.dioxide’ by the R-squared score.
##
## Call:
## lm(formula = pH ~ fixed.acidity, data = subset(redwine, fixed.acidity <
## quantile(redwine$fixed.acidity, 0.9)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49613 -0.06401 0.00121 0.06727 0.49291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.916363 0.019230 203.65 <2e-16 ***
## fixed.acidity -0.073939 0.002406 -30.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1124 on 1431 degrees of freedom
## Multiple R-squared: 0.3976, Adjusted R-squared: 0.3972
## F-statistic: 944.4 on 1 and 1431 DF, p-value: < 2.2e-16
When dropping 10% percentage of the largest value of ‘fixed.acidity’, it explained around 40% variance of ‘pH’ by the R-squared score.
##
## Call:
## lm(formula = pH ~ citric.acid, data = subset(redwine, citric.acid <
## quantile(redwine$citric.acid, 0.9)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50024 -0.07745 -0.00594 0.08267 0.58243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.427575 0.005972 573.91 <2e-16 ***
## citric.acid -0.430289 0.021020 -20.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1303 on 1437 degrees of freedom
## Multiple R-squared: 0.2258, Adjusted R-squared: 0.2252
## F-statistic: 419 on 1 and 1437 DF, p-value: < 2.2e-16
For now it seems ‘pH’ is less correlated to ‘citric.acid’ than ‘fixed.acidity’. We also want to see how the how the ratio between them correlated to ‘pH’.
##
## Call:
## lm(formula = pH ~ citric_over_fixed.acidity, data = subset(redwine,
## citric_over_fixed.acidity < quantile(redwine$citric_over_fixed.acidity,
## 0.9)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50040 -0.08331 -0.00459 0.08673 0.60393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42333 0.00679 504.17 <2e-16 ***
## citric_over_fixed.acidity -3.90239 0.21261 -18.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1383 on 1437 degrees of freedom
## Multiple R-squared: 0.1899, Adjusted R-squared: 0.1894
## F-statistic: 336.9 on 1 and 1437 DF, p-value: < 2.2e-16
By the result above, it is reasonable to infer that the ratio ‘citric_over_fixed.acidity’ is not a valuable feature for linear regression model to predict ‘pH’, because it has a weaker linear correlation to ‘pH’.
Secondly, we will use box plot to show the ‘quality’ distribution in different intervals of the value of a variable.
The structure of ‘residual.sugar’ is listed as below, we will divied the variable value into 3 intervals based on the structure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## redwine$residual.sugar.bucket: (0.1,1.9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.599 6.000 8.000
## --------------------------------------------------------
## redwine$residual.sugar.bucket: (1.9,2.5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.66 6.00 8.00
## --------------------------------------------------------
## redwine$residual.sugar.bucket: (2.5,16]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.637 6.000 8.000
The distribution of ‘quality’ does not change significantly in different groups cut by different values of ‘residual.sugar’, which infers that ‘residual.suger’ may not influence ‘quality’ significantly.
The structure of ‘chlorides’ is listed as below, we will divied the variable value into 3 intervals based on the structure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## redwine$chlorides.bucket: (0,0.07]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.873 6.000 8.000
## --------------------------------------------------------
## redwine$chlorides.bucket: (0.07,0.09]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.576 6.000 8.000
## --------------------------------------------------------
## redwine$chlorides.bucket: (0.09,0.12]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.554 6.000 7.000
## --------------------------------------------------------
## redwine$chlorides.bucket: (0.12,0.7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 5.00 5.38 6.00 7.00
Based on this box plot, we can draw the conclusion that when ‘chlorides’ has lower value, from 0 to 0.07, the red wine quality tend to have a higher level. In this plot we can see that in the first group, ‘quality’ has higher median and mean levels.
The structure of ‘total.sulfur.dioxide’ is listed as below, we will divied the variable value into 3 intervals based on the structure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## redwine$total.sulfur.dioxide.bucket: (5,22]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.748 6.000 8.000
## --------------------------------------------------------
## redwine$total.sulfur.dioxide.bucket: (22,62]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.709 6.000 8.000
## --------------------------------------------------------
## redwine$total.sulfur.dioxide.bucket: (62,289]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 5.000 5.000 5.374 6.000 8.000
By the box plot of ‘quality’ in different groups of ‘total.sulfur.dioxide.bucket’, we see that in first two groups: (5,22] and (22.62] the red wine quality has similar distribution, but after the ‘total.sulfur.dioxide’ falls in (62,289], the mean and median of ‘quality’ decreased. This implies that when ‘total.sulfur.dioxide’ reaches a relatively high amount in red wine, it may potentially decrease the red wine quality.
The structure of ‘citric.acid’ is listed as below, we will divied the variable value into 3 intervals based on the structure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## redwine$citric.acid.bucket: (-0.1,0.09]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.459 6.000 8.000
## --------------------------------------------------------
## redwine$citric.acid.bucket: (0.09,0.42]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 5.6 6.0 8.0
## --------------------------------------------------------
## redwine$citric.acid.bucket: (0.42,1]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.887 6.000 8.000
The structure of ‘citric_over_fixed.acidity’ is listed as below, we will divied the variable value into 3 intervals based on the structure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01292 0.03291 0.03084 0.04503 0.13930
## redwine$citric_over_fixed.acidity.bucket: (-0.1,0.013]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.462 6.000 8.000
## --------------------------------------------------------
## redwine$citric_over_fixed.acidity.bucket: (0.013,0.045]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.578 6.000 8.000
## --------------------------------------------------------
## redwine$citric_over_fixed.acidity.bucket: (0.045,0.14]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.928 7.000 8.000
Based on the two plots above, we noticed that when increasing the amount of ‘citric.acid’ the red wine quality increased. When the ratio between ‘citric.acid’ and ‘fixed.acidity’ falls into a ‘reasonable’ interval, (0.045,0.14], the red wine quality has a distribution of smaller range and tends to have a higher average quality level.
The structure of ‘sulphates’ is listed as below, we will divied the variable value into 3 intervals based on the structure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
By this plot, we may infer that higher value of sulphates density results in better quality.
Weak correlation:
The red wine quality index, ‘quality’, is not highly correlated to envrionment attributes, ‘pH’ and ‘residual.sugar’, which rejects my intuition.
medium correlation:
The red wine quality index, ‘quality’ is medium correlated to additive attributes:
positive correlation: ‘fixed.acidity’, ‘citric.acid’, ‘sulphates’;
negative correlation: ‘volatile.acidity’, ‘chlorides’, ‘total.sulfur.dioxide’, ‘free.sulfur.dioxide’.
strong correlation:
The red wine quality index, ‘quality’ is highly correlated to the variable ‘alcohol’. ‘quality’ tends to have higher level when the ‘alcohol’ density goes higher.
Yes, we find that the ‘pH’ has postive correlation coefficients with ‘fixed.acidity’ and ‘citric.acid’ and has negative correlation coefficients with ‘chlorides’.
The strongest relationship is between ‘quality’ and ‘alcohol’, which seems reasonable for red wine quality.
By the plots above, we notice that:
Figure 1: the two white dash line represents 0.95 and 0.05 quantile of quality, and the white solid line represents the median of quality. No trend stands out in this figure.
Figure 2: We notice that as chlorides density goes up, the red wine tend to have a lower quality.
Figure 3: the results makes more sense here than in previous two pictures. When chlorides density is low, the red wine may be produced in a healthier way with less additives and hence to have better quality (level 6 to level 7). When chlorides increases slightly, it should be easy to produce mediate level of quality (level 5 to level 6) which should most popular in people’s daily life. When adding more chlorides, the you can see the quality of red wine is hard to be high. Those samples might be produced in a worse technology.
In this section, we explored the relationship between ‘quality’ and ‘alcohol’ along with three related variables, ‘sulphates’, ‘chlorides’ and ‘total.sulfur.dioxide’. First of all, we already know that ‘quality’ is positively correlated with ‘alcohol’ in the last setction. Then by the observation in this section, we noticed a trend that red wine quality is better when the chlorides density is lower and when sulphates density is higher conditioning on the same alcohol density.
For the variable ‘total.sulfur.dioxide’, we didn’t find a fixed pattern in different subsets, but when other additives have a medium density level, the high density of sulfur dioxide results in bad quality. We also found that the red wine with low alcohol density uses more sulfur dioxide for a potential reason that it could contributes in the anti-microbial property.
Another interesting point is that the red wine samples with high alcohol density and low chlorides density stands out with better quality. This observation implies that when add a high density of sulfur dioxide in the red wine, less chlorides and more alcohol results in a better quality.
This plot explores the relationship between ‘quality’ and ‘chlorides’. Using scatter plot in figure 1, we can not find any trend even by adding auxiliary lines to show the 0.95 quantile, 0.05 quantile and median of ‘quality’ over each ‘chlorides’ value, because the ‘quality’ variable is not continous. In figure 2, we use box plot to show the descriptive statistics in each chlorides intervals, which gives us a trend that as chlorides density increases the quality of red wine decreases. In figure 3, the plot shows the trend more clear.
We know that the correlation coefficient between alcohol and quality is 0.521, which is relatively high. In this plot, we see that the data points ‘follow’ the linear regression line well.
This plot reveals different patterns of quality over alcohol in different subsets. We noticed that when the density of other additives or the alcohol density is low, most red wine samples add relatively large amount of sulfur dioxide.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
This exploratory data analysis is about a dataset including information about different ingredients of red wine samples. This dataset has 1599 observations and 12 variables except for the index. In initial phase, I explored each variables by doing univariable analysis. In this preliminary exploration, I noticed that some of the variables have long-tail distribution. Hence I transformed the x-axis scale to ‘log10’ scale and varified the log distribution is bell-like and symmetric. I noticed that the quite a few red wine samples have ‘citric.acid’ value equal to 0. This made me start to wondering whether citric acid is a good additive in red wine, and how about other additives. By showing the correlation matirx, I decided to focus on exploring the variables of higher than 0.2 correlation coefficients with ‘quality’ in pairs. Next I compared those pairs along with other variables together.
The main finds can be summarized as follows. The variables I am interested in are ‘quality’, ‘alcohol’, ‘sulphates’, ‘total.sulfur.dioxide’, ‘chlorides’. The most correlated variable with ‘quality’ is ‘alcohol’. The correlation coefficient is 0.521. The most interesting observation is that the total sulfur dioxide density plays a critical rules in enhancing red wine quality. I noticed the red wines with high total sulfur dioxide density does not need much chlorides to reach a high quality when alcohol density is high. In addition, when other additives are not in a high density, a lot of red wine samples add relatively large amount of sulfur dioxide to reach medium or medium-to-high quality level. Sulphates density and chlorides density also influence the red wine quality positively and negatively.